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Abstract 



We introduce a new family of matrix norms, the "local max" norms, generaUzing 
existing methods such as the max norm, the trace norm (nuclear norm), and the 
weighted or smoothed weighted trace norms, which have been extensively used in 
the literature as regularizers for matrix reconstruction problems. We show that this 
new family can be used to interpolate between the (weighted or unweighted) trace 
norm and the more conservative max norm. We test this interpolation on simulated 
data and on the large-scale Netflix and MovieLens ratings data, and find improved 
accuracy relative to the existing matrix norms. We also provide theoretical results 
showing learning guarantees for some of the new norms. 

1 Introduction 

In the matrix reconstruction problem, we are given a matrix F e M"^™ whose entries are only partly 
observed, and would like to reconstruct the unobserved entries as accurately as possible. Matrix 
reconstruction arises in many modern applications, including the areas of collaborative filtering 
(e.g. the Netflix prize), image and video data, and others. This problem has often been approached 
using regularization with matrix norms that promote low-rank or approximately-low-rank solutions, 
including the trace norm (also known as the nuclear norm) and the max norm, as well as several 
adaptations of the trace norm described below. 

In this paper, we introduce a unifying family of norms that generalizes these existing matrix norms, 
and that can be used to interpolate between the trace and max norms. We show that this family 
includes new norms, lying strictly between the trace and max norms, that give empirical and theo- 
retical improvements over the existing norms. We give results allowing for large-scale optimization 
with norms from the new family. Some proofs are deferred to the Supplementary Materials. 

Notation Without loss of generality we take n > m. We let M+ denote the nonnegative real 
numbers. For any n e N, let [n] = {l,...,n}, and define the simplex on [n] as A[„] = 
{r G M" : Y^- Ti = l}. We analyze situations where the locations of observed entries are sampled 
i.i.d. according to some distribution p on [n] x [m]. We write Pi. = J2j Pij to denote the marginal 
probability of row i, and Prow = (Pi., • • • , Pn.) € ^[n] to denote the marginal row distribution. 
We define p.^ and Pcoi similarly for the columns. 

1.1 Trace norm and max norm 

A conmion regularizer used in matrix reconstruction, and other matrix problems, is the trace norm 
1 1 X 1 1 ^ J,, equal to the sum of the singular values of X . This norm can also be defined via a factorization 
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where M(j) denotes the ith row of a matrix M, and where the minimum is taken over factorizations 
of X of arbitrary dimension — that is, the number of columns in A and B is unbounded. Note that we 
choose to scale the trace norm by 1/ ^nm in order to emphasize that we are averaging the squared 
row norms of A and B. 

Regularization with the trace norm gives good theoretical and empirical results, as long as the loca- 
tions of observed entries are sampled uniformly (i.e. when p is the uniform distribution on [n] x [to]), 
and, under this assumption, can also be used to guarantee approximate recovery of an underlying 
low -rank matrix Cj ^ E gl . 

The factorized definition of the trace norm ([T]l allows for an intuitive comparison with the max norm, 
defined as iH: 

ll^llmax = ^^min^ (sup||A(,)||2 + sup||B(,)||J^ ■ (2) 

We see that the max norm measures the largest row norms in the factorization, while the rescaled 
trace norm instead considers the average row norms. The max norm is therefore an upper bound 
on the rescaled trace norm, and can be viewed as a more conservative regularizer For the more 
general setting where p may not be uniform, Foygel and Srebro |4| show that the max norm is still 
an effective regularizer (in particular, bounds on error for the max norm are not affected by p). On 
the other hand, Salakhutdinov and Srebro |5 1 show that the trace norm is not robust to non-uniform 
sampling — ^regularizing with the trace norm may yield large error due to over-fitting on the rows and 
columns with high marginals. They obtain improved empirical results by placing more penalization 
on these over-represented rows and columns, described next. 

1.2 The weighted trace norm 

To reduce overfitting on the rows and columns with high marginal probabilities under the distribution 
p, Salakhutdinov and Srebro propose regularizing with the p-weighted trace norm. 



l^lltr(p) 



diag(prow)'^' • ^ • diag(pcoi)'/' 



-'row 



If the row and the column of entries to be observed are sampled independently (i.e. p = Pi 
Pcoi is a product distribution), then the p-weighted trace norm can be used to obtain good learning 
guarantees even when Piow and Pcoi are non-uniform IS] 16]. However, for non-uniform non-product 
sampling distributions, even the p-weighted trace norm can yield poor generalization performance. 
To correct for this, Foygel et al. |6| suggest adding in some "smoothing" to avoid under-penalizing 
the rows and columns with low marginal probabilities, and obtain improved empirical and theoretical 
results for matrix reconstruction using the smoothed weighted trace norm: 



l^lltr(p) 



diag(pi.ow)'/' • X ■ diag(pcoi)'^' 



where Prow and Pcoi denote smoothed row and column marginals, given by 

Prow = (1 - C) ■ Prow + C ' V" ^nd Pcol = Pcoi + C ' V™ : (3) 

for some choice of smoothing parameter ^ which may be selected with cross-validatiorQ The 
smoothed empirically-weighted trace norm is also studied in |!61|, where p^, is replaced with 
Pi, = * tmar# ob°e"v"tim^^ ' ' empirical marginal probability of row i (and same for p.j). Using 
empirical rather than "true" weights yielded lower error in experiments in |i6J, even when the true 
sampling distribution was uniform. 

More generally, for any weight vectors r G A[„] and c G A[,,„] and a matrix X E M"^™, the 
(r, c)-weighted trace norm is given by 

ll^lltr(r,c) = diag(r)^/^.X.diag(c)^/^ 



Our parameter here is equivalent to 1 — a in l6l. 
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Of course, we can easily obtain the existing methods of the uniform trace norm, (empirically) 
weighted trace norm, and smoothed (empirically) weighted trace norm as special cases of this for- 
mulation. Furthermore, the max norm is equal to a supremum over all possible weightings 17J: 

ll^llmax = sup ||^|ltr(r,c) • 

reA[„],c6A[„] 



2 The local max norm 



We consider a generalization of these norms, which lies "in between" the trace norm and max norm. 
For any TZ C A[„] and C C A[„j], we define the {TZ, C)-norm of X: 

ll^ll(K.C) = sup ||X||t,(r,e) • 

This gives a norm on matrices, except in the trivial case where, for some i or some j, = for all 
r e 7?. or Cj = for all c e C. 

We now show some existing and novel norms that can be obtained using local max norms. 



2.1 Trace norm and max norm 



We can obtain the max norm by taking the largest possible 7^ and C, i.e. ll^Hj^^^ax — II^II(a a )' 
and similarly we can obtain the (r, c)-weighted trace norm by taking the singleton sets TZ = {r} 
and C = {c}. As discussed above, this includes the standard trace norm (when r and c are uniform), 
as well as the weighted, empirically weighted, and smoothed weighted trace norm. 



2.2 Arbitrary smoothing 

When using the smoothed weighted max norm, we need to choose the amount of smoothing to 
apply to the marginals, that is, we need to choose C, in our definition of the smoothed row and 
column weights, as given in ([3]l. Alternately, we could regularize simultaneously over all possible 
amounts of smoothing by considering the local max norm with 

7^ - {(1 - C) • Prow + C ■ V" : any C e [0, 1]} , 
and same for C. That is, TZ and C are line segments in the simplex — they are larger than any single 
point as for the uniform or weighted trace norm (or smoothed weighted trace norm for a fixed amount 
of smoothing), but smaller than the entire simplex as for the max norm. 



2.3 Connection to T)-decomposability 

Hazan et al. [8| introduce a class of matrices defined by a property of (/3, r)-decomposability: a 
matrix X satisfies this property if there exists a factorization X = AB^ (where A and B may have 
an arbitrary number of columns) such that 

max j max||A(,j)||2 ,max||B(j)||2 [ < 2/3, ^ 11^(^)112 + X! 11^0') II2 - ' 

where A^j) and are the ith row of A and the jth row of B, respectiveljj^ 

Comparing with ([T]) and (|2]), we see that the /3 and r parameters essentially correspond to the max 
norm and trace norm, with the max norm being the minimal 2/3* such that the matrix is (/3*,t)- 
decomposable for some r, and the trace norm being the minimal r*/2 such that the matrix is 
(/3, r*)-decomposable for some /3. However, Hazan et al. go beyond these two extremes, and rely 
on balancing both /3 and r: they establish learning guarantees (in an adversarial online model, and 
thus also under an arbitrary sampling distribution p) which scale with \J ■ t. It may therefore be 
useful to consider a penalty function of the form: 



Penalty(^_,)(X) =^min^ <j Jmax + max • + ^ 



(4) 



^Hazan et al. state the property differently, but equivalently, in terms of a semidefinite matrix decomposition. 
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(Note that max |niaxi ||^(i) \\^ , maxj II2I is replaced with max^ + max^ ||-B(j-) 

for later convenience. This affects the value of the penalty function by at most a factor of \/2.) 

This penalty function does not appear to be convex in X. However, the proposition below (proved in 
the Supplementary Materials) shows that we can use a (convex) local max norm penalty to compute 
a solution to any objective function with a penalty function of the form 

Proposition 1. Let X be the minimizer of a penalized loss function with this modified penalty, 
X := arg min {Loss(X) + A-Pcnalty(^,.)(X)} , 

where \ >Q is some penalty parameter and Loss(-) is any convex function. Then, for some penalty 
parameter ^ > and some t € [0, 1], 

X = arg niin |loss(X) + • ||X|| I , where 
n=[re A[„, : > ^^^^ V^} and C = {c e A,,, : c, > ^^j^^ V,} . 

We note that /i and t cannot be determined based on A alone — they will depend on the properties of 
the unknown solution X. 

Here the sets TZ and C impose a lower bound on each of the weights, and this lower bound can be 
used to interpolate between the max and trace norms: when t = 1, each is lower bounded by 
i/n (and similarly for Cj), i.e. TZ and C are singletons containing only the uniform weights and we 
obtain the trace norm. On the other hand, when t = Q, the weights are lower-bounded by zero, and 
so any weight vector is allowed, i.e. TZ and C are each the entire simplex and we obtain the max 
norm. Intermediate values of t interpolate between the trace norm and max norm and correspond to 
different balances between (3 and r. 



2.4 Interpolating between trace norm and max norm 

We next turn to an interpolation which relies on an upper bound, rather than a lower bound, on the 
weights. Consider 

TZe = {re A[„] : r, < e Vi} and Cs = {c e A[„] : cj < 6 Vj} , (5) 

for some e G [V"i 1] and S E [i/m, 1]. The {TZe,Cs)-noim is then equal to the (rescaled) trace norm 
when we choose e = i/n and S = i/m, and is equal to the max norm when we choose e = S = 1. 
Allowing e and 6 to take intermediate values gives a smooth interpolation between these two familiar 
norms, and may be useful in situations where we want more flexibility in the type of regularization. 

We can generalize this to an interpolation between the max norm and a smoothed weighted trace 
norm, which we will use in our experimental results. We consider two generalizations — for each 
one, we state a definition of TZ, with C defined analogously. The first is multiplicative: 

TZl^ {r e A[„] : r, < 7 • ((1 - C) ' + C ' V") V*} , (6) 

where 7 — 1 corresponds to choosing the singleton set TZ^ ^ = {(1 — C) • Prow + C ' V"} O-^- 
smoothed weighted trace norm), while 7 = cx) corresponds to the max norm (for any choice of Q 
since we would get TZ^ ^ — A[„]. 

The second option for an interpolation is instead defined with an exponent: 

7^c,r := {r e A[„] : r, < ((1 - C) • P.. + C • V")'"" Vi} . (7) 

Here t ~ will yield the singleton set corresponding to the smoothed weighted trace norm, while 
T — 1 will yield TZ(^,r — ^[n]^ i-S- the max norm, for any choice of C. 

We find the second (exponent) option to be more natural, because each of the row marginal bounds 
will reach 1 simultaneously when t — 1, and hence we use this version in our experiments. On 
the other hand, the multiplicative version is easier to work with theoretically, and we use this in our 



learning guarantee in Section 4.2 If all of the row and column marginals satisfy some loose upper 



bound, then the two options will not be highly different. 
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3 Optimization with the local max norm 



One appeal of both the trace norm and the max norm is that they are both SDP representable f9"T0l, 
and thus easily optimizable, at least in small scale problems. Indeed, in the Supplementary Materials 
we show that the local max norm is also SDP representable, as long as the sets TZ and C can be written 
in terms of linear or semi-definite constraints — this includes all the examples we mention, where in 
all of them the sets TZ and C are specified in terms of simple linear constraints. 

However, for large scale problems, it is not practical to directly use SDP optimization approaches. 
Instead, and especially for very large scale problems, an effective optimization approach for both 
the trace norm and the max norm is to use the factorized versions of the norms, given in ([T]) and (|2|, 
and to optimize the factorization directly (typically, only factorizations of some truncated dimen- 
sionality are used) ifTTl [721 1711. As we show in Theorem[T]below, a similar factorization-optimization 
approach is also possible for any local max norm with convex TZ and C. We further give a simplified 
representation which is applicable when TZ and C are specified through element-wise upper bounds 
R e M'l and C E M'p, respectively: 

TZ = {re A[„] : r, < R, Vi} and C = {c e A[„,] : c, < Q Vj} , (8) 
with < R^ < 1, > 1' < Q < 1' E, Q > 1 to avoid triviality. This includes the 



interpolation norms of Section 2.4 



Theorem 1. IfTZ andC are convex, then the {TZ,C)-nonn can be calculated with the factorization 



cec 



(9) 



In the special case when TZ and C are defined by (|8]l, writing (a 

1 . „ r .r- „ / 2 



max{0, x}, this simplifies to 



\X\ 



(K,C) 



2 AB- 



inf 

=X:a 



Lb, 



Proof sketch for Theorem^ For convenience we will write r^/^ to mean diag(r)^^^, and same for c. 
Using the trace norm factorization identity ([T]), we have 



211X1 



2 sup 



1/2 



X 



1/2 



sup 



inf 

=rV2.X-cV2 



, i\\c\\ 

1/2 \ 



li^ll 



( 


rV^ . A 


2 

+ 


cV^ . B 


I < inf I sup 


v'/^A 


2 

+ sup 




:) 






F 




F/ AB^=x \yen 




F cec 







= sup inf 

rSK,ceC AB^=X 

where for the next-to-last step we set C — r^^^A and D — c^^^B, and the last step follows because 
sup inf < inf sup always (weak duality). The reverse inequality holds as well (strong duality), and 
is proved in the Supplementary Materials, where we also prove the special-case result. □ 



4 An approximate convex hull and a learning guarantee 

In this section, we look for theoretical bounds on error for the problem of estimating unobserved 
entries in a matrix Y that is approximately low-rank. Our results apply for either uniform or non- 
uniform sampling of entries from the matrix. We begin with a result comparing the {TZ, C)-norm unit 
ball to a convex hull of rank-1 matrices, which will be useful for proving our learning guarantee. 

4.1 Convex hull 

To gain a better theoretical understanding of the {TZ,C) norm, we first need to define corresponding 
vector norms on M" and M™. For any u e M", let 



= / sup ^ Tiuf = sup diag(r 



1/2 



We can think of this norm as a way to interpolate between the £2 and too vector norms. For example, 
if we choose TZ — TZ^ as defined in (|5]l, then \\u\\j^ is equal to the root-mean-square of the e^^ 
largest entries of u whenever e^^ is an integer. Defining \\v\\i^ analogously for v G M™, we can now 
relate these vector norms to the {TZ, C)-norm on matrices. 
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Theorem 2. For any convex TZ C A[„] and C C A[„j], the {TZ,C)-norm unit ball is bounded above 
and below by a convex hull as: 

Conv{TO^:||u||^ = \\v\\^ l} ^ • ll^ll (K.c) - ^} - Kg -Conv {uv'^ -.WuW^ = \\v\\^ = l} , 

where Kq < 1.79 is Grothendieck's constant, and implicitly u g M", v G M™. 

This result is a nontrivial extension of Srebro and Shraibman fTl's analysis for the max norm and 
the trace norm. They show that the statement holds for the max norm, i.e. when TZ = A[„] and 
C = A[^], and that the trace norm unit ball is exactly equal to the corresponding convex hull (see 
Corollary 2 and Section 3.2 in their paper, respectively). 

Proof sketch for Theorem^ To prove the first inclusion, given any X = uv^ with \\u\\^ ~ = 
1, we apply the factorization result Theorem|T|to see that < 1. Since the (7?.,C)-norm 

unit ball is convex, this is sufficient. For the second inclusion, we state a weighted version of 
Grothendieck's Inequality (proof in the Supplementary Materials): 

sup{(r,f/F^) : U e M"><^l/ e M™><^ ||C/(,)||2 < a, V^, H^H^ < 6, Vj} 

^ Kg- sup {{Y,uv'^) :ueR",veR"\\u,\<aiWi, |wj|<6jVj} . 

We then apply this weighted inequality to the dual norm to the (7?^,C)-norm to prove the desired 
inclusion, as in Srebro and Shraibman [IJ's work for the max norm case (see Corollary 2 in their 
paper). Details are given in the Supplementary Materials. □ 

4.2 Learning guarantee 

We now give our main matrix reconstruction result, which provides error bounds for a family of 
norms interpolating between the max norm and the smoothed weighted trace norm. 

Theorem 3. Let p be any distribution on [n] x [m]. Suppose that, for some 7 > \, TZ ^ 
TZy^ ^ and C 3 Cy^ ^, where these two sets are defined in (j6]l. Let S ~ {(*t, Jt) : t — 1, . . . , s} be 
a random sample of locations in the matrix drawn Ltd. from p, where s > n. Then, in expectation 
over the sample S, 

J^p, < inf Vp,|r,-x,| + of7^) . (1 + ^) , 
^ J ^ ./ 

Approximation error Excess error 

where X = argmin||j^|| <Vfc l^itjt ^ -^itjX Additionally, if we assume that s > 

nlog(n), then in the excess risk bound, we can reduce the term log(n) to •\/log(n). 

Proof sketch for Theorem^ The main idea is to use the convex hull formulation from Theorem|2] 
to show that, for any X with H-'f < Vk, there exists a decomposition X = X' + X" with 

ll^'^^'llmax — and ||-'^"||tj.(p) — ^{\/^h\ where p represents the smoothed marginals with 

smoothing parameter C = 1/2 as in ([3]). We then apply known bounds on the Rademacher complexity 
of the max norm unit ball UJ and the smoothed weighted trace norm unit ball 0, to bound the 
Rademacher complexity of \X : ll-'^jL^c) < Vk^. This then yields a learning guarantee by 
Theorem 8 of Bartlett and Mendelson lfT3l . Details are given in the Supplementary Materials. □ 

As special cases of this theorem, we can re-derive the existing results for the max norm and smoothed 
weighted trace norm. Specifically, choosing 7 = 00 gives us an excess error term of order ^Jknfs 
for the max norm, previously shown by lU, while setting 7 = 1 yields an excess error term of order 
\Jkn log(rt)/s for the smoothed weighted trace norm as long as s > n log(n), as shown in Q. 

What advantage does this new result offer over the existing results for the max norm and for the 
smoothed weighted trace norm? To simplify the comparison, suppose we choose 7 = log^(n), and 
define TZ — TZ^, and C = Cw^ ■ Then, comparing to the max norm result (when 7 = 00), we see 
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Table 1: Matrix fitting for the five methods used in experiments. 



Norm 


Fixed parameters 


Free parameters 


Max norm 


C arbitrary; t — 1 


A 


(Uniform) trace norm 


C = 1; r = 


A 


Empirically-weighted trace norm 


C = 0; r = 


A 


Arbitrarily-smoothed emp.-wtd. trace norm 


T = 


C;A 


Local max norm 




C;t;A 



that the excess error term is the same in both cases (up to a constant), but the approximation error 
term may in general be much lower for the local max norm than for the max norm. Comparing next 
to the weighted trace norm (when 7 = 1), we see that the excess error term is lower by a factor of 
log(n) for the local max norm. This may come at a cost of increasing the approximation error, but in 
general this increase will be very small. In particular, the local max norm result allows us to give a 
meaningful guarantee for a sample size s = 8 (fcn), rather than requiring s > 8 {kn log(n)) as for 
any trace norm result, but with a hypothesis class significantly richer than the max norm constrained 
class (though not as rich as the trace norm constrained class). 

5 Experiments 

We test the local max norm on simulated and real matrix reconstruction tasks, and compare its 
performance to the max norm, the uniform and empirically-weighted trace norms, and the smoothed 
empirically-weighted trace norm. 

5.1 Simulations 

We simulate n x n noisy matrices for n = 30, 60, 120, 240, where the underlying signal has rank 
A: = 2 or fc = 4, and we observe s — 3kn entries (chosen uniformly without replacement). We 
performed 50 trials for each of the 8 combinations of {n, k). 

Data For each trial, we randomly draw a matrix U G M"^'^ by drawing each row uniformly at 
random from the unit sphere in M". We generate V G similarly. We set Y — UV^ + a ■ Z, 

where the noise matrix Z has i.i.d. standard normal entries and a — 0.3 is a moderate noise level. 
We also divide the entries of the matrix into sets 6*0 U 5*1 U S2 which consist of s = 3kn training 
entries, s validation entries, and — 2s test entries, respectively, chosen uniformly at random. 

Methods We use the two-parameter family of norms defined in (|7|, but replacing the true 
marginals p^. and p.^ with the empirical marginals p^. and p.^. We consider (,t E 
{0, 0.1, . . . , 0.9, 1}. For each (C, t) combination and each penalty parameter value A e 
{2^ , 2^, . . . , 2^°}, we compute the fitted matrix 

X = argmin {j2ir.i)eSo {y^, " X.,^f + A • ll^ll } • dO) 

(In fact, we use a rank-8 approximation to this optimization problem, as described in Section [3]) 
For each of the considered matrix norm methods, we use the validation set Si to select the best 
combination of C,, r, and A, with restrictions on Q and/or r as specified by the definition of the 
method (see Table[T]i. We then report the error of the resulting fitted matrix on the test set 5*2. 

Results The results for these simulations are displayed in Figure [T] We see that the local max 
norm results in lower error than any of the tested existing norms, across all the settings used. 

5.2 Movie ratings data 

We next compare several different matrix norms on two collaborative filtering movie ratings datasets, 
the Netflix 1 14 1 and MovieLens 1 15 1 datasets. The sizes of the data sets, and the split of the ratings 
into training, validation and test set^ are: 



Dataset 


# users 


# movies 


Training set 


Validation set 


Test set 


Netflix 


480,189 


17,770 


100,380,507 


100,000 


1,408,395 


MovieLens 


71,567 


10,681 


8,900,054 


100,000 


1,000,000 



^ For Netflix, the test set we use is their "qualification set", designed for a more uniform distribution of 
ratings across users relative to the training set. For MovieLens, we choose our test set at random from the 
available data. 
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■ 


Trace 


■ 


Emp. trace 


■ 


Smth. trace 


□ 


Max 


■ 


Local max 





30 



240 



60 120 240 30 60 120 

Matrix dimension n IVIatrix dimension n 

Figure 1 : Simulation results for matrix reconstruction with a rank-2 (left) or rank-4 (right) signal, corrupted by 
noise. The plot shows per-entry squared error averaged over 50 trials, with standard error bars. For the rank-4 
experiment, max norm error exceeded 0.20 for each n — 60, 120, 240 and is not displayed in the plot. 



Table 2: Root mean squared error (RMSE) results for estimating movie ratings on Netflix and MovieLens data 
using a rank 30 model. Setting r = corresponds to the uniform or weighted or smoothed weighted trace 
norm (depending on Q, while r = 1 corresponds to the max norm for any ^ value. 



MovieLens 


Netflix 


C\r 


0.00 


0.05 


0.10 


1.00 


C\r 


0.00 


0.05 


0.10 


1.00 


0.00 


0.7852 


0.7827 


0.7838 


0.7918 


0.00 


0.9107 


0.9092 


0.9094 


0.9131 


0.05 


0.7836 


0.7822 


0.7842 




0.05 


0.9095 


0.9090 


0.9107 




0.10 


0.7831 


0.7837 


0.7846 




0.10 


0.9096 


0.9098 


0.9122 




0.15 


0.7833 


0.7842 


0.7854 




0.15 


0.9102 


0.9111 


0.9131 




0.20 


0.7842 


0.7853 


0.7866 




0.20 


0.9126 


0.9344 


0.9153 




1.00 


0.7997 




1.00 


0.9235 





We test the local max norm given in with C G {0,0.05,0.1,0.15,0.2} and t € {0,0.05,0.1}. 
We also test r = 1 (the max norm — here ( is arbitrary) and ( — 1, r = (the uniform trace norm). 
We follow the test protocol of |6|, with a rank-30 approximation to the optimization problem ( [TO] i. 

Table [2] shows root mean squared error (RMSE) for the experiments. For both the MovieLens and 
Netflix data, the local max norm with r ~ 0.05 and ( = 0.05 gives strictly better accuracy than any 
previously-known norm studied in this setting. (In practice, we can use a validation set to reliably 
select good values for the r and C parameters^) For the MovieLens data, the local max norm 
achieves RMSE of 0.7822, compared to 0.7831 achieved by the smoothed empirically-weighted 
trace norm with ( = 0.10, which gives the best result among the previously-known norms. For the 
Netflix dataset the local max norm achieves RMSE of 0.9090, improving upon the previous best 
result of 0.9096 achieved by the smoothed empirically- weighted trace norm Q. 



6 Summary 



In this paper, we introduce a unifying family of matrix norms, called the "local max" norms, that 
generalizes existing methods for matrix reconstruction, such as the max norm and trace norm. We 
examine some interesting sub-families of local max norms, and consider several different options 
for interpolating between the trace (or smoothed weighted trace) and max norms. We find norms 
lying strictly between the trace norm and the max norm that give improved accuracy in matrix 
reconstruction for both simulated data and real movie ratings data. We show that regularizing with 
any local max norm is fairly simple to optimize, and give a theoretical result suggesting improved 
matrix reconstruction using new norms in this family. 



To check this, we subsample half of the test data at random, and use it as a validation set to choose ((", t) 
for each method (as specified in Table [TJ. We then evaluate error on the remaining half of the test data. For 
MovieLens, the local max norm gives an RMSE of 0.7820 with selected parameter values = r = 0.05, as 
compared to an RMSE of 0.7829 with selected smoothing parameter = 0.10 for the smoothed weighted trace 
norm. For Netflix, the local max norm gives an RMSE of 0.9093 with — t = 0.05, while the smoothed 
weighted trace norm gives an RMSE of 0.9098 with = 0.05. The other tested methods give higher error on 
both datasets. 
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Supplementary ]VIaterials 
A Proof of Theorem 1 

Special case: element-wise upper bounds First, we assume that the general result is true, i.e. 




(11) 



and prove the result in the special case, where 

n = {re A[„] : r, < R, Vi} and C = {c e A[„] : c,- < Cj Vj} . 
Using strong duality for linear programs, we have 
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In this last line, if we fix a and want to minimize over ai e W]^, it is clear that the infimum is 

II 1 1 2 

obtained by setting an = (||^(i) ~ 0,)+ for each i. This proves that 



sup^^r. ||^(,)||J = inf a + ^i?. (I 



H«)ll2 



Applying the same reasoning to the columns and plugging everything in to ( [TT] i, we get 



0)11 2 



( sup 




2 

+ sup 




:) 


\reTZ 




F cec 







General factorization result In the proof sketch given in the main paper, we showed that 

2|l^ll(K,c)< Jnf _ fsup 

We now want to prove the reverse inequality. Since — 1 1 -'^^ 1 1 (7^ ^ ) definition (where 

S denotes the closure of a set S), we can assume without loss of generality that TZ and C are both 
closed (and compact) sets. 

First, we restrict our attention to a special case (the "positive case"), where we assume that for all 
r E TZ and all c e C, > and Cj > for all i and j. (We will treat the general case below.) 
Therefore, since H^lltj-^^ ^.^ is continuous as a function of (r, c) for any fixed X and since TZ and C 
are closed, we must have some r* E TZ and c* E C such that ||(7j ~ ll"'^lltr(r* c*)' ^^'■^ 
for all i and c* > for all j. 

Next, let UDV^ — r*^^'^ ■ X ■ c*^^^ be a singular value decomposition, and let A* = r*~^^'^U D^/^ 
and B* = c*^''^VD^/\ Then A*B*^ = X, and 



v*''^A* 



= tracc{UDU^) = tracc(D) = \\X\ 



tr(r*,c*) — ll^ll(K,C) 



Below, we will show that 



r — argmax 



(12) 



This will imply that ||-^||(7j c) — sup^gTj Hr^/^A'^Hp, and following the same reasoning for B*, we 
will have proved 



211X1 



{n.c) 



( sup 


r'/'A* 


2 


'r sup 


c^'B* 


:) 


> inf sup 




2 


- sup 




:) 


\reiz 




F 


cec 






ABT=X \ren 




F 


cec 







which is sufficient. It remains only to prove ([12]). Take any r E TZ with r 7^ r* and let w = r — r*. 
We have 



and it will be sufficient to prove that this quantity is < 0. To do this, we first define, for any t E [0, 1], 



trace 



r* + tw 
r* 



1/2 



UDU 



T 



Using the fact that trace(-) < || -H^j, for all matrices, we have 



/(i)< 



r* + tw 



{r* + twy'Xc 



1/2 

UDU 

/2 Y^*y^ 



^\\x\ 



tr(r* +iw 
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where the last inequality comes from the fact that r* + iw e 7?. by convexity of TZ. Therefore, 



l + t-^- (UDU^), 



I t=Q ^ , "^t 



as desired. (Here we take the right-sided derivative, i.e. taking a limit as t approaches zero from the 
right, since /(i) is only defined for t E [0, 1].) This concludes the proof for the positive case. 

Next, we prove that the general factorization ( fTT| ) hold in the general case, where we might have 
71 (t and/or C (t M™_^. If for any i e [njwe have = for all r e 7?,, we can discard this 
row of X, and same for any j e [to]. Therefore, without loss of generaUty, for all i £ [n] there 

is some r^'^ e TZ with rl*' > 0. Taking a convex combination, r+ = ^ J2i J'^*'' ^ ^> we have 
r+ eUn M"^. Similarly, we can construct c+ e C n M™^. 

Fix any e > 0, and let S = min{mini r^, min^ c^} • 



2(l+e) 



> 0, and define closed subsets 



7^o = |r e 7^ : minr^ > (jj C 7^ and Cq = |c e C : minc^ > C C 
Since we know that the factorization result holds for the "positive case", we have 



inf 

AB^=X 



( sup 




2 

+ sup 


c 






\reno 




F ceCo 












2 sup 







211X1 



reKo,ceCo 

Now choose any factorization = X such that 



< 2 sup 



( sup 




2 

+ sup 




) < 2 sup 












F/ reTl,ceC 





= 211X1 



^2) 



(13) 



Next, we need to show that sup^.^^^ 



is not much larger than sup^gK,, 



for 13). Choose any r' e TZ, and let r" = f 1 

\ mini rT y 

(5 



(and same 



minr,, > 



r+ e TZ. Then 



and so r" e 7?.o. We also have < ^1 



mm,; r 
s 



•''I'A 



< 1 



mm,; r. 



r" for all i. Therefore, 



< 1 



mm.i r ,- 



-1/2 



sup 



Since this is true for any r' e 7?., applying the definition of S, we have 



sup 

reTl 



< 1 



1/2 



sup 



< 



1 + ^2 



1/2 



sup 
reHo 



Applying the same reasoning for B and then plugging in the bound ( [T3] l, we have 



inf sup 



< 



2 

+ sup 




F cec 






)■( 










^( 






^ l + e 



< 



sup 

-1 



I sup 






+ sup 




\reK 




F 


cec 






2 








+ sup 


:) 




F ceCo 







Since this analysis holds for arbitrary e > 0, this proves the desired result, that 

< 211X1 



AB'^=X 



i sup 


r'/'A 


2 


1- sup 




:) 
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cec 
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B Proof of Theorem 2 



We follow similar techniques as used by Srebro and Shraibman 111 in their proof of the analogous 
result for the max norm. We need to show that 

Conv{ww^ : w e M",u e M™, \\u\\t^ ^ \\v\\c = l} C |x : ||X||(k.c) < l} ^ 

Kg ■ Conv {uu^ : u e W\v £ W"\ \\u\\^ = ||u||^ = 1} . 

For the left-hand inclusion, since is a norm and therefore the constraint ||-'i^||(7j < 1 is 

< 1 for any w e M" , t; e 



(K,C) 

convex, it is sufficient to show that Hmw^ " 



with \\u\\^ = \\v\\^ — 



1. This is a trivial consequence of the factorization result in Theorem 1. 

Now we prove the right-hand inclusion. Grothendieck's Inequality states that, for any Y e 
and for any dimension k. 



|C/(.)||,<1V*, ||yo-)||,<lVj} 



sup {{Y,UV^) : U e M"^^l/ e M™^^ 

< Kg ■ sup{{Y,uv'^) : li e e M™, \ui\ < 1 Vi, \vj\ < 1 Vj} , 

where Kg G (1-67, 1.79) is Grothendieck's constant. We now extend this to a slightly more general 
form. Take any a G M" and b e M™. Then, setting U — diag(a)+J7 and V = diag(6)+F (where 
M+ is the pseudoinverse of M), and same for u and v, we see that 

sup {{Y,UV^) : U e e M'"><^ < a^Vt, \\V^,)\\^ < b, Vj} 

= sup {(diag(a) • Y ■ diag(5), UV^) : U e M"^^ y e M'"^^ C7(,) ^ < 1 Vi, V(j) ^ < 1 Vj} 

<iCG-sup{(diag(a)-y-diag(&),™^) : t2 e M", e M"Mwi| < 1 Vi, < IVj} 

= i^G ■ sup { {Y, uu^) : u e M", t; G M", \u,\ < ai Vi, |wj | < bj Vj} . (14) 



Now take any Y e 



Let 



be the dual norm to the (7?.,C)-norm. To bound this dual 



norm of Y , we apply the factorization result of Theorem 1; 



\\y\\\nc)= iY.X) 

II "V II <r' 1 

11"^ 11(7?., C)-::-^ 

= sup { (y, UV^) : 



2 \^reK ^ 



r,||C/(,)||J+sup^c, < 1 



cec 



(*) 



sup <^ {Y,UV^) : sup^r, \\U^i)\\^^ = sup^c^ \\V(j)\\\ < 1 



cec ' 



sup sup {(y, UV'^) : ||(7(,)||2 < a, Vi, H^yjH^ < &j Vj} 



< ■ sup sup ) : < ai Vi, < bj Vj} 

aeK":!|a||^<l (7,y 
f)eR™:||b||c<l 

= Kg ■ sup {(r,™^) : \\u\\^ < 1, < 1} 

U.V 

= Kg ■ sup {{Y,X) : X e Conv {w^ : u G M", w e K 



X 



nn = \\n 



sup{(r,X) : X e Kg- Conv{™^ : u e W\v e R"\ \\u\\^ 



= 1}} 
-1}} 



As in |T|, this is sufficient to prove the result. Above, the step marked (*) is true because, given any 
U and V with 



2 \reTZ 



r,\U, 



+sup 
cec 



U)\\l 



< 1 
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we can replace U and V with U' := U ■ uj and V :=¥ -uj-^, where cj := // ""P°ec 

^ Y ™P..,^E.r.|lC/(,)||^ 

This will give U'V^ = C/y^, and 



sup Vr, [/(', 



= sup ^c, 



-sup^c, ||F( 



/sup 



ceC 



SUp^C, <1. 



cec 



C Proof of Theorem 3 

Following the strategy of Srebro & Shraibman (2005), we will use the Rademacher complexity to 
bound this excess risk. By Theorem 8 of Bartlett & Mendelson (2002^^ we know that 



ll^ll(,^,C)<V^~ 



t] Xij I 



O (Es [Ks ({a- e : IIA'II,^ < 



^}) 



, (15) 



where the expected Rademacher complexity is defined as 



7^< 



({ 



X e 



l^ll(K,C) - 



^}) 



:= -E 



sup 



where v G {±1}* is a random vector of independent unbiased signs, generated independently from 
S. 

Now we bound the Rademacher complexity. By scaling, it is sufficient to consider the case k ~ 1. 
The main idea for this proof is to first show that, for any X with ||^|| (tj 1^ 1? we can decompose 

X into a sum X' + X" where ||^'||„,ax - and ||-'«^"||ti.(g) < 2^^G7"'/^ where p represents 
the smoothed row and column marginals with smoothing parameter C, = 1/2, and where Kq < 1.79 
is Grothendieck's constant. We will then use known Rademacher complexity bounds for the classes 
of matrices that have bounded max norm and bounded smoothed weighted trace norm. 

To construct the decomposition of X, we start with a vector decomposition lemma, proved below. 

Lemma 1. Suppose TZ 3 ^- Then for any u G K" with \\u\\j^ — 1, we can decompose u into a 

sum u = u' + u" such that < 1 and ||u"||p^^^ := X^i Pi.'"'/^ ^ 7^^^^- 

Next, by Theorem 2, we can write 



oo 

X = KG-^ti -uivj 
1=1 



where ti > 0, J^lZi = 1' and ||u;||7j = = 1 for all I- Applying Lemma[T]to ui and to vi 

for each I, we can write ui = u[ + u'{ and vi = v'l + u", where 



Then 



1=1 



(=1 



1=1 



The statement of their theorem gives a result that holds with high probability, but in the proof of this result 
they derive a bound in expectation, which we use here. 
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Furthermore, ||mJ||p < < 1, and ||wi||p < < 1- Applying Srebro and Shraibman 

ifTI's convex hull bounds for the trace norm and max norm (stated in Section 4 of the main paper), 
we see that ||Xi|Lax ^ 1' ™d that that < 7~'/2 fo,- i ^ 2,3. Defining X' = Xi and 

X" = X2 + X^, we have the desired decomposition. 

Applying this result to every X in the class |^ ^ jjnxm . ||X|| ^-j < l|, we see that 



<E5 



7^s({xeM"-":||X||(^^c)<l}) 

^5 {{X' : < Kg})] + Es [Us {{x" : ||X"||t,(p) < Kg ■ 2^-^^] 



{^)^KG-2,-y^-oU 



nlog(n) nlog(n) 



<Kg-0 

where the last step uses bounds on the Rademacher complexity of the max norm and weighted trace 
norm unit balls, shown in Theorem 5 of [1 1 and Theorem 3 of |(6l, respectively. Finally, we want 
to deal with the last term, "'"^^"^ that is outside the square root. Since s > n by assumption, we 

have < y^ "'°s (" j"^ ^jj^j if § > nlog(n), then we can improve this to " < ^ »iog(») " 

Returning to ( fTS) and plugging in our bound on the Rademacher complexity, this proves the desired 
bound on the excess risk. 

C. 1 Proof of Lemma [T] 

For u e M" with \\u\\^ = 1, we need to find a decomposition u = u' + u" such that < 1 

and ~ Pi.^i'^ 1^ 1^^^^- Without loss of generality, assume > • • • > Find 

N £ {1, . . . , n} and t e (0, 1] so that Y^^=i Pi. + ^ ' Pw. — 1^^7 ™d let 

r = 7 • (Pi., ■ • ■ , V{N-i).,t ■ Pw., 0, . . . , 0) G A[„] . 
Clearly, < 7 • pi. for all i, and so r e Tly.^ ^ QTl- 

Now let u" = (ui, . . . , u^v-i, Vi • ujy, 0, . . . , 0), and set u' = u — u" . We then calculate 

N-l n 
i=l i=l 

Finally, we want to show that < 1. Since u'^ = Qfoi i < N, we only need to bound for 
each i > N. We have 

n N , , N 

i' — 1 i' — 1 i' — 1 

where the step marked (*) uses the fact that \uii\ > \ui\ for all i' < N, and the step marked (#) 
comes from the fact that r is supported on {1, ... , N}. This is sufficient. 

D Proof of Proposition 1 

Let Lq — Loss(X). Then, by definition, 

X — arg min |Penalty(^ : Loss(X) < ioj . 

Then to prove the lemma, it is sufficient to show that for some t e [0, 1], 

X = argmin|||X||(^^^^ ■ Loss(X) < Lq} , 

where we set 

T^W = (r e A[„j : r. > ^^^f^ V.} , C(.) = |c e : c, > ^^^^^ Vj] . 
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Trivially, we can rephrase these definitions as 

^(*) = { l + (n-l).r (^'--^)+ l + (n-Vt -^^^"^'"'}^"' 

Note that for any vectors u e M" and v E M™, 

sup ^ TiUi — max Ui and sup > CjVj = maxvj . (17) 

Applying the SDP formulation of the local max norm (proved in Lemma|2]below), we have 

ll^ll(TC„),c„)) = Jinf i sup ^r,[/,, + sup ^Cj V,-. 

2 |_ 1 + (n — 1) • i ■'^ 1 + (n — 1) • i j 

1 + {m - I) ■ t ^ ■'^ 1 + (m - 1) ■ t 3 " \ ^ / J 

yinf |t^A,, + (l-<)maxA,, + i^Bjj + (l-<)maxB„ : ( ^ ) ^ol 

= inf I (1 - • M(^, B)+t- T{A, B):Xe Xa,b^ , (18) 
where for the next-to-last step, we define 



U X 
■ \ X^ V 



> 



\\ + {n-l)-t' \l + {m-l)-t' ^{l + {n-l)-t){l + {7n-l)-t) 

and for the last step, we define 

T(A, B) ^ trace(A) + trace(i?), M{A, B) = maxA^i + maxSj^ , 

i 3 

and 



Next, we compare this to the r) penalty formulated in our main paper. Recall 



Penalty (^_,) (X) - ^^i^f^^ I y ™f ^ 1 1 1 1 2 + "^f ^ 1 1 -^O") I ^ 

Applying Lemma[3]below, we can obtain an equivalent SDP formulation of the penalty 

Penalty (^^,) (X) = mf { ^M{A, B) ■ ^T{A,B) : X e Xa,b } • (19) 

Since M(A, B) < T{A, B) < max{n, m}M{A, B), and since for any a;, y > we know ^/xy < 
^ [a ■ X + a^^ ■ y) for any a > with equality attained when a = we see that 



Penalty(^ ^) (X) = - inf i inf {a ■ M{A,B) + a^^ ■ T{A,B)} : X e Xa,b 

2 A.B I^Q,g[i_.^niax{n,m}] 



inf 

[1,1^/ inax{ n,m}] 



\a-M{A,B)+a-^ ■T{A,B) : X e Xa,b] 
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Since the quantity inside the square brackets is nonnegative and is continuous in a, and we are 
minimizing over a in a compact set, the infimum is attained at some a, so we can write 

Penalty(^_^)(X) = i mf |S-M(A,B)+S-i •T(A,B) : X £ Xa^b} ■ 

Recall that X minimizes Penalty j-^ t)(^) subject to the constraint Loss(X) < Lg. Setting t :— 
-~ I , we get 

X e argmini inf {a ■ M(A, B) + ■ T(A,B) : X e Xa b] ■ Loss(X) < Lq\ 

X \^A,B J 

= argmini inf ■M{A,B) + ^ " • T{A,B) : X e Xa b\ ■ Loss(X) < Lq\ 

X [^A,B [^a + a ^ a + a ^ J J 

= argmin | inf {(1 - t) ■ M{A, B)+t- T{A, B) : X e Xa b} ■ Loss(X) < Lq\ 
X \^a,b J 

= argmin { II X II ■ Loss(X) < Xq} , 

as desired. 



E Computing the local max norm with an SDP 

Lemma 2. Suppose TZ and C are convex, and are defined by SDP-representable constraints. Then 
the (TZ,C)-non7i can be calculated with the semidefinite program 



l^ll/T, = - inf < sup > r,,;A,: + sup> c,B„- : ( ^ )^0> . 



In the special case where TZ and C are defined as in (8) in the main paper, then the norm is given by 
W-^WlK c) ^ n i^^' I + B^ ai + b + C^bi : an > and a + an > An Vi, 



2 



bij > and b + bij > Bjj Vj, ^ ^j- ^ ^ ^ o| . 



Proof. For the general case, based on Theorem 1 in the main paper, we only need to show that 
inf I sup ^ TiAii + sup ^ CjBjj ■ -^t ^ ^ ^ 



= inf sup^r^ ||^(i)||2 +sup^Cj ||i?(j)||2 : AB^ = X 



|2 



This is proved in Lemma [3]below. 

For the special case where TZ and C are defined by element-wise bounds, we return to the proof of 
Theorem 1 given in Section |A] where we see that 

2||X||,^^. = inf \a+R^ai+b+C^bi : a+au > Vi, b+bi-j > \\B(^^)\\1 

AB^ =X,a,beR L J 

Noting that ||^(i)||2 = (^4^4^);.; and ||-B(j)||2 — again use Lemmajijto see that this 

is equivalent to the SDP 

inf a + R^ai + b + C^bi : aii > and a + oii > An Vi, 



6i, > and 6 + 6i, > B,, Vj, ( ^ ]hO\ 



□ 
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Lemma 3. Let f : M" x M™ Rbe any function that is nondecreasing in each coordinate and let 
X g be any matrix. Then 

inf |/ , ||^(„)||2, ,. .., ||B(,„)||2) : AB'^ = x| 

= inf |/($ii,...,$„„,^'ii,...,*„,„) : ^ ^ ^ o| , 

where the factorization AB^ = X is assumed to be of arbitrary dimension, that is, A e M"^*^ and 
B e M™><'= /or arbitrary fc G N. 

Proof. We follow similar arguments as in Lemma 14 in lfT6l . where this equality is shown for the 
special case of calculating a trace norm. 

For convenience, we write 

g{A,B) = f (||A(i)||J , . . . , \\A(^n)\\l , \\Bii)\\l II^Mlla) 

and 

*) = / (^-ii, . . . , $„„, . . . , *™™) . 
Then we would like to show that 

inf {g{A, B) : AB'^ = X] = inf |/i(<i>, ^) : * ) " °} ' 

T T -r / $ X \ 

First, take any factorization AB ' = X. Let $ = AA ' and * = BB ' . Then ( ^-j ^1^0, 
and we have g{A, B) ~ by definition. Therefore, 

inf {g{A, B) : AB'^ ^ X] > inf |/i(<l>, : ( ^ ) ^ " 

Next, take any $ and ^ such that I ^-p ^ I h 0. Take a Cholesky decomposition 

' AA^ AB^ 
BA^ BB^ + CC^ 

From this, we see that AB^ = X, that = \\^ for all i, and that '^jj > jjj for all j. 

Since / is nondecreasing in each coordinate, we have h{^, 5*) > g{A, B). Therefore, we see that 

inf {g{A, B) : AB'^ = X) < inf |/i(<i>, : ^fr ^ ) ^ 

□ 
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